Robust Speech Recognition via Anchor Word Representations
نویسندگان
چکیده
A challenge for speech recognition for voice-controlled household devices, like the Amazon Echo or Google Home, is robustness against interfering background speech. Formulated as a far-field speech recognition problem, another person or media device in proximity can produce background speech that can interfere with the device-directed speech. We expand on our previous work on device-directed speech detection in the far-field speech setting and introduce two approaches for robust acoustic modeling. Both methods are based on the idea of using an anchor word taken from the device directed speech. Our first method employs a simple yet effective normalization of the acoustic features by subtracting the mean derived over the anchor word. The second method utilizes an encoder network projecting the anchor word onto a fixed-size embedding, which serves as an additional input to the acoustic model. The encoder network and acoustic model are jointly trained. Results on an in-house dataset reveal that, in the presence of background speech, the proposed approaches can achieve up to 35% relative word error rate reduction.
منابع مشابه
An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...
متن کاملImproving the performance of MFCC for Persian robust speech recognition
The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...
متن کاملHeadstart for speech segmentation: a neural signature for the anchor word effect.
Learning a new language is an incremental process that builds upon previously acquired information. To shed light on the mechanisms of this incremental process, we studied the on-line neurophysiological correlates of the so-called anchor word effect where newly learned words facilitate segmentation of novel words from continuous speech. Higher segmentation performance was observed for speech st...
متن کاملCombining acoustic and articulatory feature information for robust speech recognition
The idea of using articulatory representations for automatic speech recognition (ASR) continues to attract much attention in the speech community. Representations which are grouped under the label ‘‘articulatory’’ include articulatory parameters derived by means of acoustic-articulatory transformations (inverse filtering), direct physical measurements or classification scores for pseudo-articul...
متن کاملContinuous speech recognition using phone-based anchor point detection and diphone-based dp-matching
This paper deals with acoustic-phonetic decoding for CSR. There are two different processing modules depending on the stea.dy or transient nature of the speech input. First the stea.dy state speech processing module, called phone-based anchor point detection, performs some preprocessing a.llowing the selection of only a subset of the vocabulary under consideration. Secondly a general processing...
متن کامل